Goto

Collaborating Authors

 rnn cell


Dilated Recurrent Neural Networks

Neural Information Processing Systems

Learning with recurrent neural networks (RNNs) on long sequences is a notoriously difficult task. There are three major challenges: 1) complex dependencies, 2) vanishing and exploding gradients, and 3) efficient parallelization. In this paper, we introduce a simple yet effective RNN connection structure, the DilatedRNN, which simultaneously tackles all of these challenges. The proposed architecture is characterized by multi-resolution dilated recurrent skip connections and can be combined flexibly with diverse RNN cells. Moreover, the DilatedRNN reduces the number of parameters needed and enhances training efficiency significantly, while matching state-of-the-art performance (even with standard RNN cells) in tasks involving very long-term dependencies. To provide a theory-based quantification of the architecture's advantages, we introduce a memory capacity measure, the mean recurrent length, which is more suitable for RNNs with long skip connections than existing measures. We rigorously prove the advantages of the DilatedRNN over other recurrent neural architectures.


Fast-Slow Recurrent Neural Networks

Asier Mujika, Florian Meier, Angelika Steger

Neural Information Processing Systems

The FS-RNN incorporates the strengths of both multiscale RNNs and deep transition RNNs as it processes sequential data on different timescales and learns complex transition functions from one time step to the next.


ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Danieli, Federico, Rodriguez, Pau, Sarabia, Miguel, Suau, Xavier, Zappella, Luca

arXiv.org Artificial Intelligence

Since its introduction by Vaswani et al. (2023), the Transformer architecture has quickly imposed itself as the de-facto choice for sequence modeling, surpassing previous state-of-the-art, RNN-based models such as GRU and LSTM (Cho et al., 2014; Hochreiter & Schmidhuber, 1997). One key reason behind the rapid adoption of Transformers lies in the efficiency of their application at training time: their core sequence mixer, the attention mechanism, can in fact be applied in parallel along the length of the input sequence. This effectively overcomes one main limitation of classical RNNs, whose application must be unrolled sequentially along the input sequence. In more recent times, however, interest in RNNs has been rekindled, largely due to their reduced memory footprint and improved efficiency at inference time. In particular, recent advancements in State Space Models (SSMs) such as Mamba (Gu & Dao, 2023; Dao & Gu, 2024) have started gaining popularity and are imposing themselves as a potential alternative to Transformers, at least for small-to-mid-sized models (Zuo et al., 2024). To grant parallelization during training (and hence a comparable performance to attention), SSMs simplify the recurrence relationship at their core to one that is purely linear in the hidden state. This simplification enables leveraging associativity and parallel reduction operations to quickly compute the output of the application of an SSM to a whole input sequence in parallel along its length. Despite the success of modern SSMs, the linearity constraint remains a limitation which hinders their expressive power (Merrill et al., 2025; Cirone et al., 2025), and is dictated by necessity rather than choice. With ParaRNN, we aim to overcome the constraint of linearity for SSMs, while unlocking parallelization also for nonlinear RNNs, thus enriching the space of viable options for sequence modeling.


Lattice Protein Folding with Variational Annealing

Khandoker, Shoummo Ahsan, Inack, Estelle M., Hibat-Allah, Mohamed

arXiv.org Artificial Intelligence

Understanding the principles of protein folding is a cornerstone of computational biology, with implications for drug design, bioengineering, and the understanding of fundamental biological processes. Lattice protein folding models offer a simplified yet powerful framework for studying the complexities of protein folding, enabling the exploration of energetically optimal folds under constrained conditions. However, finding these optimal folds is a computationally challenging combinatorial optimization problem. In this work, we introduce a novel upper-bound training scheme that employs masking to identify the lowest-energy folds in two-dimensional Hydrophobic-Polar (HP) lattice protein folding. By leveraging Dilated Recurrent Neural Networks (RNNs) integrated with an annealing process driven by temperature-like fluctuations, our method accurately predicts optimal folds for benchmark systems of up to 60 beads. Our approach also effectively masks invalid folds from being sampled without compromising the autoregressive sampling properties of RNNs. This scheme is generalizable to three spatial dimensions and can be extended to lattice protein models with larger alphabets. Our findings emphasize the potential of advanced machine learning techniques in tackling complex protein folding problems and a broader class of constrained combinatorial optimization challenges.


Fast-Slow Recurrent Neural Networks

Asier Mujika, Florian Meier, Angelika Steger

Neural Information Processing Systems

Processing sequential data of variable length is a major challenge in a wide range of applications, such as speech recognition, language modeling, generative image modeling and machine translation. Here, we address this challenge by proposing a novel recurrent neural network (RNN) architecture, the Fast-Slow RNN (FS-RNN). The FS-RNN incorporates the strengths of both multiscale RNNs and deep transition RNNs as it processes sequential data on different timescales and learns complex transition functions from one time step to the next. We evaluate the FS-RNN on two character level language modeling data sets, Penn Treebank and Hutter Prize Wikipedia, where we improve state of the art results to 1.19 and 1.25 bits-per-character (BPC), respectively. In addition, an ensemble of two FS-RNNs achieves 1.20 BPC on Hutter Prize Wikipedia outperforming the best known compression algorithm with respect to the BPC measure. We also present an empirical investigation of the learning and network dynamics of the FS-RNN, which explains the improved performance compared to other RNN architectures. Our approach is general as any kind of RNN cell is a possible building block for the FS-RNN architecture, and thus can be flexibly applied to different tasks.


Supplementing Recurrent Neural Networks with Annealing to Solve Combinatorial Optimization Problems

Khandoker, Shoummo Ahsan, Abedin, Jawaril Munshad, Hibat-Allah, Mohamed

arXiv.org Artificial Intelligence

Combinatorial optimization problems can be solved by heuristic algorithms such as simulated annealing (SA) which aims to find the optimal solution within a large search space through thermal fluctuations. The algorithm generates new solutions through Markov-chain Monte Carlo techniques. This sampling scheme can result in severe limitations, such as slow convergence and a tendency to stay within the same local search space at small temperatures. To overcome these shortcomings, we use the variational classical annealing (VCA) framework [1] that combines autoregressive recurrent neural networks (RNNs) with traditional annealing to sample solutions that are uncorrelated. In this paper, we demonstrate the potential of using VCA as an approach to solving real-world optimization problems. We explore VCA's performance in comparison with SA at solving three popular optimization problems: the maximum cut problem (Max-Cut), the nurse scheduling problem (NSP), and the traveling salesman problem (TSP). For all three problems, we find that VCA outperforms SA on average in the asymptotic limit by one or more orders of magnitude in terms of relative error. Interestingly, we reach large system sizes of up to 256 cities for the TSP. We also conclude that in the best case scenario, VCA can serve as a great alternative when SA fails to find the optimal solution.


DartsReNet: Exploring new RNN cells in ReNet architectures

Moser, Brian, Raue, Federico, Hees, Jörn, Dengel, Andreas

arXiv.org Artificial Intelligence

We present new Recurrent Neural Network (RNN) cells for image classification using a Neural Architecture Search (NAS) approach called DARTS. We are interested in the ReNet architecture, which is a RNN based approach presented as an alternative for convolutional and pooling steps. ReNet can be defined using any standard RNN cells, such as LSTM and GRU. One limitation is that standard RNN cells were designed for one dimensional sequential data and not for two dimensions like it is the case for image classification. We overcome this limitation by using DARTS to find new cell designs. We compare our results with ReNet that uses GRU and LSTM cells. Our found cells outperform the standard RNN cells on CIFAR-10 and SVHN. The improvements on SVHN indicate generalizability, as we derived the RNN cell designs from CIFAR-10 without performing a new cell search for SVHN.


Minimal Width for Universal Property of Deep RNN

Song, Chang hoon, Hwang, Geonho, Lee, Jun ho, Kang, Myungjoo

arXiv.org Artificial Intelligence

A recurrent neural network (RNN) is a widely used deep-learning network for dealing with sequential data. Imitating a dynamical system, an infinite-width RNN can approximate any open dynamical system in a compact domain. In general, deep networks with bounded widths are more effective than wide networks in practice; however, the universal approximation theorem for deep narrow structures has yet to be extensively studied. In this study, we prove the universality of deep narrow RNNs and show that the upper bound of the minimum width for universality can be independent of the length of the data. Specifically, we show that a deep RNN with ReLU activation can approximate any continuous function or $L^p$ function with the widths $d_x+d_y+2$ and $\max\{d_x+1,d_y\}$, respectively, where the target function maps a finite sequence of vectors in $\mathbb{R}^{d_x}$ to a finite sequence of vectors in $\mathbb{R}^{d_y}$. We also compute the additional width required if the activation function is $\tanh$ or more. In addition, we prove the universality of other recurrent networks, such as bidirectional RNNs. Bridging a multi-layer perceptron and an RNN, our theory and proof technique can be an initial step toward further research on deep RNNs.


Building A Recurrent Neural Network From Scratch In Python

#artificialintelligence

We can think of the RNN model shown in figure 1 as a repeated use of a single cell shown in figure 2. First, we will implement a single cell and then we can loop through it to stack multiple of these single cells over each other and create the forward pass of the RNN model. The basic RNN cell takes as input? Let's implement the RNN cell shown in figure 2 through these four main steps: Let's first implement the softmax activation function:


An Improved Time Feedforward Connections Recurrent Neural Networks

Wang, Jin, Zou, Yongsong, Lim, Se-Jung

arXiv.org Artificial Intelligence

Recurrent Neural Networks (RNNs) have been widely applied to deal with temporal problems, such as flood forecasting and financial data processing. On the one hand, traditional RNNs models amplify the gradient issue due to the strict time serial dependency, making it difficult to realize a long-term memory function. On the other hand, RNNs cells are highly complex, which will significantly increase computational complexity and cause waste of computational resources during model training. In this paper, an improved Time Feedforward Connections Recurrent Neural Networks (TFC-RNNs) model was first proposed to address the gradient issue. A parallel branch was introduced for the hidden state at time t-2 to be directly transferred to time t without the nonlinear transformation at time t-1. This is effective in improving the long-term dependence of RNNs. Then, a novel cell structure named Single Gate Recurrent Unit (SGRU) was presented. This cell structure can reduce the number of parameters for RNNs cell, consequently reducing the computational complexity. Next, applying SGRU to TFC-RNNs as a new TFC-SGRU model solves the above two difficulties. Finally, the performance of our proposed TFC-SGRU was verified through several experiments in terms of long-term memory and anti-interference capabilities. Experimental results demonstrated that our proposed TFC-SGRU model can capture helpful information with time step 1500 and effectively filter out the noise. The TFC-SGRU model accuracy is better than the LSTM and GRU models regarding language processing ability.